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Abstract 

We study properties of popular near-uniform (Dirichlet) priors for learning undersampled proba- 
bility distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, 
an Occam-style phase space argument expands the priors into their infinite mixture and resolves most 
of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distribu- 
tions. 

Learning a probability distribution from examples is one of the basic problems in data analysis. 
Common practical approaches introduce a family of parametric models, leading to questions about model 
selection. In Bayesian inference, computing the total probability of the data arising from a model involves 
an integration over parameter space, and the resulting "phase space volume" automatically discriminates 
against models with larger numbers of parameters — whence the description of these volume terms as 
Occam factors [|l], ^. As we move from finite parameterizations to models that are described by smooth 
functions, the integrals over parameter space become functional integrals and methods from quantum 
field theory allow us to do these integrals asymptotically; again the volume in model space consistent 
with the data is larger for models that are smoother and hence less complex [^. Further, at least under 
some conditions the relevant degree of smoothness can be determined self-consistently from the data, so 
that we approach something like a model independent method for learning a distribution [^. 

The results emphasizing the importance of phase space factors in learning prompt us to look back at 
a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here 
the probability distribution is just a list of numbers {qi}, i = 1,2, ■ ■ ■ , K, where K is the number of 
bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to 
believe that any qi and qj should be similar. The task is to learn this distribution from a set of examples, 
which we can describe as the number of times ri; each possibility is observed in a set of iV = X^ili 
samples. This problem arises in the context of language, where the index i might label words or phrases, 
so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about 
similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i 
might label n-mers of the the DNA or amino acid sequence, and although most work in the field is based 
on metrics for sequence comparison one might like an alternative approach that does not rest on such 
assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes 
a set of discrete "words," and estimates of the information content in the response are determined by the 
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probability distribution on this discrete space. What all of these examples have in common is that we 
often need to draw some conclusions with data sets that are not in the asymptotic limit N ^ K. Thus, 
while we might use a large corpus to sample the distribution of words in English by brute force (reaching 
N ^ K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. 

In models described by continuous functions, the infinite number of "possibilities" can never be 
overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of 
this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities 
may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution 
would be one which does not have too much mutual information between these variables. Even more 
simply, we might say that smooth distributions have large entropy. While the idea of "maximum entropy 
inference" is common |^], the interplay between constraints on the entropy and the volume in the space 
of models seems not to have been considered. As we shall explain, phase space factors alone imply 
that seemingly sensible, more or less uniform priors on the space of discrete probability distributions 
correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. 
We argue that reliable inference outside the asymptotic regime N ^ K requires a more uniform prior 
on the entropy, and we offer one way of doing this. While many distributions are consistent with the data 
when N < K,we provide empirical evidence that this flattening of the entropic prior allows us to make 
surprisingly reliable statements about the entropy itself in this regime. 

At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform 
priors on the space of distributions. The natural "uniform" prior is given by 



where the delta function imposes the normalization, is the total volume in the space of models, and 
the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normal- 
ization constraint, an individual qi chosen from this distribution in fact is not uniformly distributed — this 
is also an example of phase space effects, since in choosing one qi we constrain all the other {qj^i}. 
What we mean by uniformity is that all distributions that obey the normalization constraint are equally 
likely a priori. 

Inference with this uniform prior is straightforward. If our examples come independently from {qi}, 
then we calculate the probability of the model {qi} with the usual Bayes rule; 

p/r -Lir IN -P({"JI{g»})^u({g»}) p.. ..r .^ A, .n, .^x 

P[{<lt}\{nt}) = p (J , P({nt}\{qt}) = [[(qr) (2) 

If we want the best estimate of the probability qi in the least squares sense, then we should compute the 
conditional mean, and this can be done exactly, so that |^ 0] 

Thus we can think of inference with this uniform prior as setting probabilities equal to the observed 
frequencies, but with an "extra count" in every bin. This sensible procedure was first introduced by 
Laplace |||]. It has the desirable property that events which have not been observed are not automatically 
assigned probability zero. 

'If the data are unordered, extra combinatorial factors liave to be included in P{{'n-i}\{qi}). However, these cancel immediately 
in later expressions. 
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A natural generalization of these ideas is to consider priors that have a power-law dependence on the 
probabilities, the so called Dirichlet family of priors: 
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It is interesting to see what typical distributions from these priors look like. Even though different 
Qi's are not independent random variables due to the normalizing ^-function, generation of random 
distributions is still easy: one can show that if qj's are generated successively (starting from i = \ and 
proceeding up to i = K) from the Beta-distribution 
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then the probability of the whole sequence {qi] is 
'Ppi.ili})- Fig- shows some typical distributions gen- 
erated this way. They represent different regions of the 
range of possible entropies: low entropy 1 bit, where 
only a few bins have observable probabilities), entropy 
in the middle of the possible range, and entropy in the 
vicinity of the maximum, log2 K. When learning an un- 
known distribution, we usually have no a priori reason to 
expect it to look like only one of these possibilities, but 
choosing (3 pretty much fixes allowed "shapes." This will 
be a focal point of our discussion. 

Even though distributions look different, inference 
with all priors Eq. (Q) is similar [^, ^]: 
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Figure 1: Typical distributions, K — 1000. 
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This simple modification of the Laplace's rule, Eq. (||), which allows us to vary probability assigned to the 
outcomes not yet seen, was first examined by Hardy and Lidstone [l^]. Together with the Laplace's 
formula, (3=1, this family includes the usual maximum likelihood estimator (MLE), (3^0, that 
identifies probabilities with frequencies, as well as the Jeffreys' or Krichevsky-Trofimov (KT) estimator, 
/3 = 1/2 |l^ |l3|], the Schurmann-Grassberger (SG) estimator, (3 = 1/K |jl4|, and other popular 
choices. 

To understand why inference in the family of priors defined by Eq. (Q) is unreliable, consider the 
entropy of a distribution drawn at random from this ensemble. Ideally we would like to compute this 
whole a priori distribution of entropies. 



VfiiS) = / dqidq2 ■ ■ ■ dqx Ppiiqi}) <5 
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but this is quite difficult. However, as noted by Wolpert and Wolf [^, one can compute the moments of 
VfsiS) rather easily. Transcribing their results to the present notation (and correcting some small errors). 
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we find: 



m= {S[n^ = 0])p = V^o(k + 1)-^'o(/3+1): 
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where -0)71 ( 2^) = {d/dx) 




Figure 2: ^(/3)/log2i^ and a{(3) as func- 
tions of (3 and K\ gray bands are the region 
of ±cr(/?) around the mean. Note the transi- 
tion from the logarithmic to the linear scale 
at /3 = 0.25 in the insert. 



^ log2 r(a;) are the polygamma functions. 

This behavior of the moments is shown on Fig. |[ 
We are faced with a striking observation: a priori dis- 
tributions of entropies in the power-law priors are ex- 
tremely peaked for even moderately large K. Indeed, as 
a simple analysis shows, their maximum standard devia- 
tion of approximately 0.61 bits is attained at /? w ^/K, 
where ^(/3) w l/ln2 bits. This has to be compared 
with the possible range of entropies, [0, log2 K], which is 
asymptotically large with K. Even worse, for any fixed 
/? and sufficiently large K, ^(/3) = logj K - 0{K°), 
and (t(/3) oc l/y/n. Similarly, if K is large, but k is 
small, then ^(/3) cx k, and cr(/3) oc ^/k. This paints a 
lively picture: varying (3 between and oo results in a 
smooth variation of ^, the a priori expectation of the en- 
tropy, from to S'max = log2 K. Moreover, for large 
K, the standard deviation of 'Pf3{S) is always negligible 
relative to the possible range of entropies, and it is neg- 
ligible even absolutely for ^ ^ 1 (/9 3> l/K). Thus a 
, leads to a disaster: fixing (3 specifies the entropy almost 



seemingly innocent choice of the prior, Eq. 
uniquely. Furthermore, the situation persists even after we observe some data: until the distribution is 
well sampled, our estimate of the entropy is dominated by the prior! 

Thus it is clear that all commonly used estimators mentioned above have a problem. While they may 
or may not provide a reliable estimate of the distribution {gij^} they are definitely a poor tool to learn 
entropies. Unfortunately, often we are interested precisely in these entropies or similar information- 
theoretic quantities, as in the examples (neural code, language, and bioinformatics) we briefly mentioned 
earlier 

Are the usual estimators really this bad? Consider this: for the MLE ((3 — 0), Eqs. ^ are formally 
wrong since it is impossible to normalize 'Po({'Zi})- However, the prediction that Vq{S) — 5{S) still 
holds. Indeed, Sml, the entropy of the ML distribution, is zero even for = 1, let alone for iV = 0. 
In general, it is well known that Sml always underestimates the actual value of the entropy, and the 
correction 



S = S'ml 



(10) 



K* ^ / 1 

is usually used (cf. Here we must set K* — K — 1 X.o have an asymptotically correct result. 

Unfortunately in an undersampled regime, N K, this is a disaster. To alleviate the problem, different 
authors suggested to determine the dependence K* — K* {K) by various (rather ad hoc) empirical [ p3| ] 

^In any case, the answer to this question depends mostly on the "metric" chosen to measure reliability. Minimization of bias, 
variance, or information cost (Kullback-Leibler divergence between the target distribution and the estimate) leads to very different 
"best" estimators. 
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or pseudo-Bayesian techniques [|16|]. However, then there is no principled way to estimate both the 
residual bias and the error of the estimator 

The situation is even worse for the Laplace's rule, (3 — 1. We were unable to find any results in the 
literature that would show a clear understanding of the effects of the prior on the entropy estimate, 5'l. 
And these effects are enormous: the a priori distribution of the entropy has a{l) ^ 1/ \/K and is almost 
(5-like. This translates into a very certain, but nonetheless possibly wrong, estimate of the entropy. We 
believe that this type of error (cf. Fig. ^ has been overlooked in some previous literature. 

The Schurmann-Grassberger estimator, /3 — 1/K, deserves a special attention. The variance of 
PjsiS) is maximized near this value of (3 (cf. Fig. ||). Thus the SG estimator results in the most uniform a 
priori expectation of S possible for the power-law priors, and consequently in the least bias. We suspect 
that this feature is responsible for a remark in Ref. [ |l4} | that this f3 was empirically the best for studying 
printed texts. But even the SG estimator is flawed: it is biased towards (roughly) l/ln2, and it is still a 
priori rather narrow. 

Summarizing, we conclude that simple power-law 
priors, Eq. (Q), must not be used to learn entropies 
when there is no strong a priori knowledge to back 
them up. On the other hand, they are the only pri- 
ors we know of that allow to calculate (qi), (S), (x^), 
...exactly [^]. Is there a way to resolve the prob- 
lem of peakedness of VpiS) without throwing away 
their analytical ease? One approach would be to use 
'P|^'{{Q^}) = ^SlS V-''--\S[q,]) as a prior on 
{qi}. This has a feature that the a priori distribution of S 
deviates from uniformity only due to our actual knowl- 
edge r'"'^'^''\S[qi]), but not in the way VpiS) does. 
However, as we akeady mentioned, VplSlqi]) is yet to 
be calculated. 

Another way to a flat prior is to write ViS) = 
1 = J S{S — £,)d(^. If we find a family of priors 
■p({(7i}, parameters) that result in a (5-function over S, 
and if changing the parameters moves the peak across the 
whole range of entropies uniformly, we may be able to use this. Luckily, 'Pp{S) is almost a (5-function! ^ 
In addition, changing (3 results in changing ^{f3) ~ {S[ni = 0])p across the whole range [0, log2 K]. 
So we may hope that the prior [| 




Figure 3: Learning the /3 — 0.02 distribution 
from Fig. with /3 = 0.001,0.02,1. The 
actual error of the estimators is plotted; the 
error bars are the standard deviations of the 
posteriors. The "wrong" estimators are very 
certain but nonetheless incorrect. 



ri{q.y,(3) = ^s(i-f2<i)f[it' 

\ 1=1 I i=l 



dm 

d[3 



(11) 



may do the trick and estimate entropy reliably even for small N, and even for distributions that are 
atypical for any one (3. We have less reason, however, to expect that this will give an equally reliable 

^The approximation becomes not so good as /3 ^ since cr(/3) becomes 0(1) before dropping to zero. Even worse, P^(S') 
is skewed at small /3. This accumulates an extra weight at S = 0. Our approach to dealing with these problems is to ignore them 
while the posterior integrals are dominated by /3's that are far away from zero. This was always the case in our simulations, but is 
an open question for the analysis of real data. 

Priors that are formed as weighted sums of the different members of the Dirichlet family are u sual ly called Dirirhlet mixture 
priors. They have been used to estimate probability distributions of, for example, protein sequences ||l7||. Equation (|l l|), an infinite 
mixture, is a further generalization, and, to our knowledge, it has not been studied before. 



5 



estimator of the atypical distributions themselves.^ Note the term d(,/d(3 in Eq. ([TTj). It is there because 
^, not /?, measures the position of the entropy density peak. 

Inference with the prior, Eq. (pi]), involves additional averaging over (3 (or, equivalently, but is 
nevertheless straightforward. The a posteriori moments of the entropy are 

s=. . J''^MU..OKS">.]),,,^ ^^^^ 

J dCp{^, [n^]) 

Here the moments (S'™[ni] )^(^) are calculated at fixed f3 according to the (corrected) formulas of 
Wolpert and Wolf M. We can view this inference scheme as follows: first, one sets the value of /3 
and calculates the expectation value (or other moments) of the entropy at this /?. For small N, the ex- 
pectations will be very close to their a priori values due to the peakedness of Vp{S). Afterwards, one 
integrates over /3(^) with the density p{S,), which includes our a priori expectations about the entropy of 
the distribution we are studying [V {fi (C))]^ as well as the evidence for a particular value of (3 [F-terms 
inEq.(|l3|)]. 

The crucial point is the behavior of the evidence. If it has a pronounced peak at some (3^1, then the 
integrals over /3 are dominated by the vicinity of the peak, S is close to ^(/?ci), and the variance of the 
estimator is small. In other words, data "selects" some value of (3, much in the spirit of Refs. - [Q]. 
However, this scenario may fail in two ways. First, there may be no peak in the evidence; this will result 
in a very wide posterior and poor inference. Second, the posterior density may be dominated by /3 — > 0, 
which corresponds to MLE, the best possible fit to the data, and is a discrete analog of overfitting. While 
all these situations are possible, we claim that generically the evidence is well-behaved. Indeed, while 
small (3 increases the fit to the data, it also increases the phase space volume of all allowed distributions 
and thus decreases probability of each particular one [remember that {qi)i3 has an extra /3 counts in each 
bin, thus distributions with qi < (3/ [N + k) are strongly suppressed]. The fight between the "goodness 
of fit" and the phase space volume should then result in some non-trivial /3c/, set by factors cx in the 
exponent of the integrand. 

Figure ^ shows how the prior, Eq. (^l]), performs on some of the many distributions we tested. The 
left panel describes learning of distributions that are typical in the prior 'Pp{{qi}) and, therefore, are 
also likely in 'P{{qi] \ (3). Thus we may expect a reasonable performance, but the real results exceed all 
expectations: for all three cases, the actual relative error drops to the 10% level at N as low as 30 (recall 
that K — 1000, so we only have ~ 0.03 data points per bin on average)! To put this in perspective, 
simple estimates like fixed (3 ones, MLE, and MLE corrected as in Eq. ( p^ with K* equal to the number 
of nonzero n^'s produce an error so big that it puts them off the axes until > 100. |^ Our results have 
two more nice features: the estimator seems to know its error pretty well, and it is almost completely 
unbiased. 

One might be puzzled at how it is possible to estimate anything in a 1000-bin distribution with just a 
few samples: the distribution is completely unspecified for low A^! The point is that we are not trying to 
learn the distribution — in the absence of additional prior information this would, indeed, take N ^ K 
— but to estimate just one of its characteristics. It is less surprising that one number can be learned well 
with only a handful of measurements. In practice the algorithm builds its estimate based on the number 

'More work is needed to compare our estimator to more complex techniques, like in Ref. Jl^ p^ . 
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Figure 4: Learning entropies with the prior Eq. ( pi] ) and V{P) — 1. The actual relative errors of the 
estimator are plotted; the error bars are the relative widths of the posteriors, (a) Distributions from 
Fig. |l| (b) Distributions atypical in the prior. Note that while S may be safely calculated as just (S)^^^, 

one has to do an honest integration over f3 to get S*^ and the error bars. Indeed, since ^^^{S) is almost a 
(5-function, the uncertainty at any fixed f3 is very small (see Fig. ||). 

of coinciding samples (multiple coincidences are likely only for small /3), as in the Ma's approach to 
entropy estimation from simulations of physical systems [p^. 

What will happen if the algorithm is fed with data from a distribution {qt] that is strongly atypical in 
P)'^ Since there is no {qi] in our prior, its estimate may suffer Nonetheless, for any {qi}, there is 
some /3 which produces distributions with the same mean entropy as S[qi\. Such (3 should be determined 
in the usual fight between the "goodness of fit" and the Occam factors, and the correct value of entropy 
will follow. However, there will be an important distinction from the "correct prior" cases. The value of (3 
indexes available phase space volumes, and thus the smoothness (complexity) of the model class In 
the case of discrete distributions, smoothness is the absence of high peaks. Thus data with faster decaying 
Zipf plots (plots of bins' occupancy vs. occupancy rank i) are rougher. The priors Vi}{{qi\) cannot 
account for all possible roughnesses. Indeed, they only generate distributions for which the expected 
number of bins v with the probability mass less than some q is given by v{q) — KB{q, P,k — f3), where 
B is the familiar incomplete Beta function, as in Eq. (^). This means that the expected rank ordering for 
small and large ranks is 
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(14) 
(15) 



In an undersampled regime we can observe only the first of the behaviors. Therefore, any distribution 
with qi decaying faster (rougher) or slower (smoother) than Eq. ( [l4| ) for some [3 cannot be explained well 
with fixed /3ci for different N . So, unlike in the cases of learning data that are typical in 'Pp{{qi]), we 
should expect to see /3ci growing (falling) for qualitatively smoother (rougher) cases as N grows. 

Figure ^b) and Tbl. |] illustrate these points. First, we study the /3 = 0.02 distribution from Fig. |lj 
However, we added a 1000 extra bins, each with qi = 0. Our estimator performs remarkably well. 
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and Pel does not drift because the ranking law remains the same. Then we turn to the famous Zipf's 
distribution, so common in Nature. It has rii oc which is quaUtatively smoother than our prior 
allows. Correspondingly, we get an upwards drift in f3c\- Finally, we analyze a "rough" distribution, 
which has cx 50 — 4(lni)^, and (3c\ drifts downwards. Clearly, one would want to predict the depen- 
dence (3ci{N) analytically, but this requires calculation of the predictive information (complexity) for the 
involved distributions Jlgf and is a work for the future. Notice that, the entropy estimator for atypical 
cases is almost as good as for typical ones. A possible exception is the 
100-1000 points for the Zipf distribution — they are about two standard 
deviations off. We saw similar effects in some other "smooth" cases also. 
This may be another manifestation of an observation made in Ref. |Q: 
smooth priors can easily adapt to rough distribution, but there is a limit 
to the smoothness beyond which rough priors become inaccurate. 

To summarize, an analysis of a priori entropy statistics in common 
power-law Bayesian estimators revealed some very undesirable features. 
We are fortunate, however, that these minuses can be easily turned into 
pluses, and the resulting estimator of entropy is precise, knows its own 
error, and gives amazing results for a very large class of distributions. 



N 
units 

10 

30 

100 
300 
1000 
3000 
10000 
Table 



1/2 full Zipf rough 



1.7 
2.2 
2.4 
2.2 
2.1 
1.9 
2.0 
/3c 



1907 
0.99 
0.86 
1.36 
2.24 
3.36 
4.89 



16.8 
11.5 
12.9 
8.3 
6.4 
5.4 
4.5 



for solutions 



shown on Fig. ^(b). 
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